Toronto is the capital city of the Canadian province of Ontario. With a recorded population of approximately 2.7 million in 2016, it is the most populous city in Canada and the fourth most populous city in North America. The diverse population of Toronto reflects its current and historical role as an important destination for immigrants to Canada. More than 50 percent of residents belong to a visible minority population group, and over 200 distinct ethnic origins are represented among its inhabitants. Toronto is an international centre of business, finance, arts, and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world. Toronto covers an area of 630 square kilometres (243 sq mi), with a maximum north–south distance of 21 km (13 mi). It has a maximum east–west distance of 43 km (27 mi) and it has a 46-kilometre (29 mi) long waterfront shoreline, on the northwestern shore of Lake Ontario.
Toronto encompasses a geographical area formerly administered by many separate municipalities. These municipalities have each developed a distinct history and identity over the years, and their names remain in common use among Torontonians. Former municipalities include East York, Etobicoke, Forest Hill, Mimico, North York, Parkdale, Scarborough, Swansea, Weston and York. Throughout the city there exist hundreds of small neighbourhoods and some larger neighbourhoods covering a few square kilometres.
The objective of this problem is to analyze and select the best locations in the city of Toronto, Canada to open new bakery. Utilizing data science methodology and instruments such data analysis and data visualization project aims to provide new insights for declared business problem.
To proceed with research we will use such data:
Data sources to help:
# import libraries for data
import pandas as pd
import numpy as np
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
# map rendering library
#pip install folium # uncomment this line to install Folium
import folium
# library to handle requests
import requests
# defining Toronto coordinate to initiate map later on
toronto_geolocator = Nominatim(user_agent="toronto_explorer")
toronto_address = 'Toronto, Ontario'
toronto_location = toronto_geolocator.geocode(toronto_address)
toronto_latitude = toronto_location.latitude
toronto_longitude = toronto_location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))
# import JSON file with Toronto venues from previous task
table = pd.read_json(r'toronto_venues.json')
table.head()
We need to segment all the venues to find our competitors. So, the next step will be to filter out all the bakeries. Also after exploring all the venues, we will consider such categories as bagel shop, creperie, cupcake shop, donut shop, pastry shop, pie shop, and sandiwch place. It is manual work, but still has to be done.
# making list of competitors
competitors = ['Bakery',
'Bagel Shop',
'Creperie',
'Cupcake Shop',
'Donut Shop',
'Pastry Shop',
'Pie Shop',
'Sandwich Place']
Let's explore most saturated Postal Codes with venues
# filtering by list of competitors
toronto_venues = table.copy()
toronto_venues_competitors = toronto_venues[toronto_venues['Venue Category'].isin(competitors)]
Let's visualize all the competitors on the map
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=14)
# adding competitors to the map
for lat, lng, category, venue in zip(toronto_venues_competitors['Venue Latitude'],
toronto_venues_competitors['Venue Longitude'],
toronto_venues_competitors['Venue Category'],
toronto_venues_competitors['Venue']):
label = '{}, {}'.format(category, venue)
label = folium.Popup(label, parse_html=True)
folium.Marker(
[lat, lng],
popup=label,
parse_html=False).add_to(map_toronto)
map_toronto
To proceed further let's count venues by each postal code. That helps us to understand more popular places where all the traffic is.
# count venues in every Postal Code
toronto_venues_count = toronto_venues.groupby('Postal Code').count().sort_values(by='Venue', ascending=False)
# count competitors in every Postal Code
toronto_venues_competitors_count = toronto_venues_competitors.groupby('Postal Code').count().sort_values(by='Venue', ascending=False)
# adding number of competitors to previous table
toronto_venues_count['Number of Competitors'] = toronto_venues_competitors_count['Venue']
#replace NaN to 0
toronto_venues_count['Number of Competitors'].fillna(0, inplace=True)
# making column as integer data type for consistency
toronto_venues_count['Number of Competitors'] = toronto_venues_count['Number of Competitors'].astype(int)
toronto_venues_count['Percent of Competitors'] = round(toronto_venues_count['Number of Competitors'] / toronto_venues_count['Venue'] * 100, 2)
# delete unnecessary columns
toronto_venues_count = toronto_venues_count[['Venue', 'Number of Competitors', 'Percent of Competitors']]
print(toronto_venues_count.shape)
toronto_venues_count.head()
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Postal Code'] = toronto_venues['Postal Code']
# add Postal Code as first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
# drop each of competitors to clean dataframe
toronto_onehot.drop(competitors, axis=1, inplace=True)
print(toronto_onehot.shape)
# group rows by mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
# set index and add columns about competitors
toronto_grouped.set_index('Postal Code', inplace=True)
toronto_grouped['Number of Competitors'] = toronto_venues_count['Number of Competitors']
toronto_grouped['Percent of Competitors'] = toronto_venues_count['Percent of Competitors']
print(toronto_grouped.shape)
# function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
# making dataframe with top-5 venues for each postal code
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped.index
for ind in np.arange(toronto_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.drop(['Percent of Competitors',
'Number of Competitors'], axis=1).iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()
To choose number of clusters let's use 'elbow mothed'. (Please, see more details on the internet if you are interested)
X = toronto_grouped.copy()
# calculate distortion for a range of number of cluster
distortions = []
for i in range(1, 11):
km = KMeans(
n_clusters=i, init='random',
n_init=10, max_iter=300,
tol=1e-04, random_state=0
)
km.fit(X)
distortions.append(km.inertia_)
# plot
df_distortions = pd.DataFrame(distortions, columns=['distortions'])
df_distortions['clusters'] = range(1,11)
figure_5 = px.line(df_distortions, x="clusters", y="distortions")
figure_5.show()
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.copy()
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = table.copy()
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
# cleaning and adjusting dataframe
toronto_merged = toronto_merged.dropna()
toronto_merged.reset_index(inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)
toronto_merged.drop(['index'], axis=1, inplace=True)
print(toronto_merged.shape)
toronto_merged.head()
# create map
map_venue_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Venue Latitude'],
toronto_merged['Venue Longitude'],
toronto_merged['Venue Category'],
toronto_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_venue_clusters)
map_venue_clusters
Let's find out if there any difference between clusters by venues amount and
# making dataframe with info about postal codes/clusters/the most popular venues
toronto_postal_codes_clustered = toronto_merged.drop_duplicates(subset=['Postal Code'])
toronto_postal_codes_clustered = toronto_postal_codes_clustered.reset_index(drop=True)
toronto_postal_codes_clustered.set_index('Postal Code', inplace=True)
toronto_postal_codes_clustered.drop(['Venue', 'Venue Category', 'Venue Latitude', 'Venue Longitude'], axis=1, inplace=True)
toronto_postal_codes_clustered['Number of Competitors'] = toronto_venues_count['Number of Competitors']
toronto_postal_codes_clustered['Percent of Competitors'] = toronto_venues_count['Percent of Competitors']
toronto_postal_codes_clustered['Venues'] = toronto_venues_count['Venue']
print(toronto_postal_codes_clustered.shape)
toronto_postal_codes_clustered.head()
# making tabel with total calculations for all clusters
clusters_total_calc = {'Cluster': toronto_postal_codes_clustered['Cluster Labels'].unique(),
'Num of Competitors': list(toronto_postal_codes_clustered.groupby('Cluster Labels')['Number of Competitors'].sum()),
'Venues Total': list(toronto_postal_codes_clustered.groupby('Cluster Labels')['Venues'].sum()),
'Percent of Competitors': list(toronto_postal_codes_clustered.groupby('Cluster Labels')['Percent of Competitors'].mean())
}
postal_codes_clusters_total = pd.DataFrame(data=clusters_total_calc)
postal_codes_clusters_total
As we can see top-3 clusters by total venues are Cluster 0, Cluster 2, and Cluster 3. But number of competitors is highest in Cluster 2 (and venues total also).
toronto_postal_codes_cluster_2 = toronto_postal_codes_clustered.loc[(toronto_postal_codes_clustered['Cluster Labels'] == 2) &
(toronto_postal_codes_clustered['Venues'] > 99)]
toronto_postal_codes_cluster_2 = toronto_postal_codes_cluster_2.sort_values(by='Number of Competitors', ascending=False)
# filtering postal codes with lowest competitors number in Cluster 2
toronto_postal_codes_cluster_2_potential = toronto_postal_codes_cluster_2.loc[toronto_postal_codes_cluster_2['Number of Competitors'] < 6]
# making dataframe with info about venues/clusters
toronto_venues_clustered = toronto_merged.copy()
# toronto_venues_clustered = toronto_venues_clustered.reset_index(drop=True)
toronto_venues_clustered.drop(['1st Most Common Venue',
'2nd Most Common Venue',
'3rd Most Common Venue',
'4th Most Common Venue',
'5th Most Common Venue'], axis=1, inplace=True)
print(toronto_venues_clustered.shape)
toronto_venues_clustered.head()
# filtering out only Cluster 2
toronto_venues_cluster_2 = toronto_venues_clustered.loc[toronto_venues_clustered['Cluster Labels'] == 2]
toronto_venues_cluster_2.shape
# create map
map_cluster_2 = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=13)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 2, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add Cluster 2 to the map
markers_colors = []
for lat_c2, lng_c2, category_c2, venue_c2 in zip(toronto_venues_cluster_2['Venue Latitude'],
toronto_venues_cluster_2['Venue Longitude'],
toronto_venues_cluster_2['Venue Category'],
toronto_venues_cluster_2['Venue']):
label_c2 = folium.Popup(str(category) + ' Venue ' + str(venue), parse_html=True)
folium.CircleMarker(
[lat_c2, lng_c2],
radius=4,
popup=label_c2,
color=rainbow[0],
fill=True,
fill_color=rainbow[0],
fill_opacity=0.7).add_to(map_cluster_2)
# adding potential postal codes to the map
for lat_p, lng_p, postal_code in zip(toronto_postal_codes_cluster_2_potential['Postal Code Latitude'],
toronto_postal_codes_cluster_2_potential['Postal Code Longitude'],
toronto_postal_codes_cluster_2_potential.index
):
label_p = folium.Popup('Postal Code: {}'.format(postal_code), parse_html=True)
folium.Marker(
[lat_p, lng_p],
popup=label_p,
parse_html=False).add_to(map_cluster_2)
# adding competitors to the map
for lat_comp, lng_comp, category_comp, venue_comp in zip(toronto_venues_competitors['Venue Latitude'],
toronto_venues_competitors['Venue Longitude'],
toronto_venues_competitors['Venue Category'],
toronto_venues_competitors['Venue']):
label_comp = folium.Popup('{}, {}'.format(category_comp, venue_comp), parse_html=True)
folium.CircleMarker(
[lat_comp, lng_comp],
radius=10,
popup=label_comp,
color=rainbow[1],
fill=True,
fill_color=rainbow[1],
fill_opacity=0.7).add_to(map_cluster_2)
map_cluster_2
As we can see from the map tho postal codes M5B and M5C situated in area with high amount of venues (so there is nice traffic) and have not so much competitors around. Let's discover what boroughs these are!
# import JSON file with Toronto boroughs from previous task
toronto_boroughs = pd.read_json(r'toronto_boroughs.json')
# looking for the most perspective boroughs by posta code
toronto_boroughs.loc[(toronto_boroughs['Postal Code'] == 'M5B') | (toronto_boroughs['Postal Code'] == 'M5C')]
A lot of things happened, so let's sum up what was going on.
First of all we made list of all venue categories and filtered out manually competitors only. The next step was to count number of competitors and their percentage from total venues amount by every postal code.
To tranform text data we used one hot encoding. That was preliminary step to use unsupervised machine learning technique K-Mean Clustering. As we didn't know number of clusters we used 'Elbow method' (see here for details). Afterwards we put number of custers in models to fit it.
Clustering helped us to pick up areas with highest potential. We filtered out most saturated with competitors areas.
The last step was to add three layers to the maps: venues, competitors and preferred postal codes, so we can see easily postal codes we need.
Thank you!